Tony Li
General Architecture of
Cloud IAAS
Brief
Whats cloud
http://en.wikipedia.org/wiki/Cloud_computing
This doc is about
The public general architect and implementation of cloud IaaS layer.
Core issue/technique
1fast deploy resource and obtain service based on virtualization technology (virtualization)
2implement dynamical, flexible extension ability (dynamical and elastic )
3apply for and pay for according to used quantity (count according to needs )
4provide via network; massive information processing (distributed calculation and storage )
Virtualization
Key words: Libvirt, kvm, dhcp, vlan
Virtualization of physical machines
Kvm, vmware, xen, hyber-v which one to choose?
KVM is an open source system virtualization module,
kvm needs hardware support
Libvirt is an open source C lib tool providing virtualization function in linux.
A general strategy of physical machine virtualization is to call libvirt via JNI, in turn, to invoke linux command to operate kvm.
Virtualization of network
DHCP, IP manager.
Vswitch, virtual switch
Vlan, virtual lan
A general strategy of network virtualization is to create a vlan using vswitch to make sure users in different vlan can’t visit each
other even if they are in the same real lan.
Back
Dynamical and elastic
Key words: LVS, keepalived
LVS
http://www.linuxvirtualserver.org/how.html
http://www.linuxvirtualserver.org/whatis.html
Keepalived
Note: Lvs and Keepalived are not a sort of technique, they are software,
see the content in /etc/keepalived/keepalived.conf
and /etc/init.d/keepalived in your linux
A general strategy of load balancer/LVS
1. A virtualServer(VS) is neither a virtual machine(VM) nor a physical machine, but a virtual server on a virtual machine.
2. A VS will dispatch requests to many realServers who provide real services. So there is no real logic on VS which only handle
the request dispatching process.
3. Create a VM before creating a VS. There is at most one VS on one VM
4. realServers belongs to one VS do NOT know the existence of each other.
5. Use keepalived to detect the stress of real server cluster.
6. Auto add/delete real server according to stress to load balance.
Back
A general strategy of pay:
For VM, VS and image storage, cost = multiple special weight * price * time.
used more, pay more;
used less, pay less;
used, pay;
do not use, do not pay.
Trick: for image storage, the larger the lower price.
Special weight
Weight of VM, according to templatenormal, standard, enhanced.. and VM status
suspend, run, default..
Weight of VS, according to template and real server count
Weight of image, according to image size.
Example of Amazon EC2
Back
Pay only for what you use
Distributed storage
What is distributed storage
A distributed data store is a computer network where information is stored on more than one node,
often in a replicated fashion.
Disadvantages of HDFS
single point of failure of master node(name node)
Usage of mysql cluster
Overview of iaas layer manager
Encalyptus structure
Cloud
Manage resource information and handle user requests
Communicate with database
Cluster
Manage physical machine, virtual network,
load balance, auto scale
Communicate with memory
Node
Control physical machine, DNS, LVS,
Create, delete, migrate virtual machine
Control load balance.
Communicate with physical machine
cloud
cluster
node
Node 1 communicate with OS
Generally, nodes communicate with OS via two ways.
Invoke shell(for *nix) and cmd(for win*)
public static int execEx(String cmd, StringBuffer stdout, StringBuffer stderr){
String[] shellCmd = new String[3];
if(System.getProperty("os.name") == null)
return -1;
if(System.getProperty("os.name").indexOf("Windows")!=-1){
shellCmd[0] = "cmd.exe";
shellCmd[1] = "/c";
}else{
shellCmd[0] = "/bin/sh";
shellCmd[1] = "-c";
}
shellCmd[2] = cmd;
Process process = Runtime.getRuntime().exec(shellCmd);
}
Invoke libvirt API
import org.libvirt.Connect;
Connect conn = Connect("qemu:///system", false);
conn.domainCreateXML(xmlDomain, 0);
Node 2 virtual machine
Node creates one machine
Step1: Create a gateway if there is not one on this node
The *nix shell commands that will be used:
For gateway
ip link show vconfig addbrctl showbrctl addifbrctl addbrip link setip route add
For DHCP
lsof -iUDP:67
String cmd= cmdPath +" -s 127.0.0.1 -P 6767 " + paramIfaces;
Step2: startup a VM
Copy image
Copy a specified image from the distributed file storage system.
Invoke libvirt to startup vm
Libvirt needs a xml as the virtual machine setting
http://blog.csdn.net/whuqin/article/details/6732898
General sample code:
createInstance(VirtualMachine vm) {
String xmlDomain = generateDomainXML(vm);
Connect conn = new Connect("qemu:///system", false);
conn.domainCreateXML(xmlDomain, 0);
}
Node 3 load balance
Nodes create virtual server
Generally linux can do Direct Routing by modifying the head of Ether frame package, then re-send.
/etc/init.d/keepalived start
/etc/init.d/keepalived reload
/etc/keepalived/keepalived.conf is the config file.
Nodes create real server
Firstly create a vm then add the vm’s ip and port into virtualServer-dispatched ip cluster.
public static String generateRealServerCmd(long vip){
String cmd = "";
String strVip = Utilities.convertLong2IP(vip) ;
cmd += "VIP=" + strVip + "\n";
cmd += "/sbin/ifconfig lo:0 $VIP broadcast $VIP netmask 255.255.255.255 up" + "\n";
cmd += "/sbin/route add -host $VIP dev lo:0" + "\n";
cmd += "sysctl -w net.ipv4.ip_forward=0" + "\n";
cmd += "sysctl -w net.ipv4.conf.lo.arp_ignore=1" + "\n";
cmd += "sysctl -w net.ipv4.conf.lo.arp_announce=2" + "\n";
cmd += "sysctl -w net.ipv4.conf.all.arp_ignore=1" + "\n";
cmd += "sysctl -w net.ipv4.conf.all.arp_announce=2" + "\n";
return cmd;
}
back
Cluster can reduce the stress on db and can control the nodes in high level.
Cluster is mainly used to re-send the commands from cloud to nodes
including selecting node, controlling node, supervising nodes.
select node
When cloud ask one idle cluster to create one virtual machine.
Cluster will find the idlest node in its nodes cluster, then startup vm on it.
Idlest means: used cpu, used memory, used disk are calculated according to some weight and the value is smallest.
Also, there must not bu vsHost and defaultGateway on this node.
Supervise node status
Nodes will report their status to cluster continually( like 5~10 seconds once), cluster will summary the status during one period(
like 2 minutes once) and then report to cloud.
Cloud will record the status into db and decide whether this node needs to migrate.
Details in the status that node report to cluster.
List<VmStatus> vmStatus; vm status, like running, suspend, etc.
NodeStatus status;physical machine status, like running, shutdown..
ReportMachineStatus nodeStatus;physical status (not report every time), like cpu used
List<ReportMachineStatus> vmInfos; vm status details (not report every time), like vm cpu used.
List<ReportLoadBalancer> lbInfos; load balance status
usedMemory; usedCpuRatio; usedDisk; totalDisk; totalMemory;
List<VmProgress> vmProgressList; vm creating process
back
cluster
Migrate vm process
Including backup, delete, create, modify.
threes migrate types:
1 manually.
Migrate one service(one real server) or one vm onto another node manually.
2 after node’s death.
When cluster find some node is died, cluster will notice cloud, then cloud will get the node info from db and for each vm on
that node, select the latest image the vm backup, and create a new vm base on that image.
3 migrate periodically.
Occasionally the resource is not enough to migrate all vms at one time. So cloud will check the db
periodically and get the list of “migrated, but failed” and then check if there is enough resource to migrate them now.
Thread in Cloud
Migrate process is very fast. So thread must be used to get image, delete image, calculate space, etc.
Db
1. Do not order the result in db-reader side if you want to get them all, can order the result in the invoker itself.
2. The query result from hibernate can not be translated into string directly via yaml. Deep-copy them.
Yaml:
Cloud is a node
From node and clusters view, cloud is in high-level. But from elb-cloud and elb-clusters view, cloud is a node.
cloud
elb
Elb-cloud and elb-cluster
Elb also has three levels, and the buttom level is node level, and elb considers cloud as its node.
Elb has its now db, Elb-cloud is responsible for db operation, elb-cluster is responsible for memory operation.
Communitation between Elb-cluster and cluster
Cloud will tell cluster about elb-clusters ipport, so when cluster create one load balancer on one node, it knows to whom it can
report.
Elb create realServer
Step1, create vm
Step2, add this vm into the cluster that elb control.
Step3, elb create realServe( but not by node itself or keepalived)
Pay for Elb
Elb has its own pay-calculation system handled by elb-cloud according to vm template and count.
realServer delete principle
If the vmis created by elb, then the vm can be deleted when eld is deleted.
Else ( that means the vm and elb are created separately and vm was added into eld’s cluster), vm won’t be deleted along with elb.
Why, why not
Typical design
Is there any single point of failure?
Theoretically, NO. Vm and vs on node can be migrate, cluster uses local memory and also can get info from
cloud.
Cloud record all infos into db. Db uses db cluster.
Why do not use HDFS directly?
It has single point of failure in master node.
About distributed lock, why zookeeper but not jgroup
Admittedly, Jgroup is a good tool, the function is powerful, the group-broadcast is easy to use.
But for complex date structure, it is easy to write code causing infinite-loop, it is hard to control code
unless the developer is a high-qualified guy.
Why, why not
About virtualization, why kvm but not xen or vmware.
http://www.flexiant.com/hypervisor-comparison-kvm-xen-vmware-hyper-v/
Kvm is the best choice for a free, open source, support both linux and windows well, easy to study and
deploy solution.
About db, Why postgres but not mysql or orical, db2.
when using hibernate and general sql request, it does not matter what db is using.
Postgres is more stable than mysql in multi-thread mode.
oracle and db2 is not free.
About intranet, Why udp and tcp mixture but not one of them.
for less important info, udp is a good choice. Generally, there should be some tolerance considering the
network lag.
for important info, tcp is reliable. Make sure each session is correct. If it is incorrect, send warning to
manager immediately.
For communication between nodes, socket is a good choice, use apache.thrift or similar protocol to
transfer stream and use snakeyaml or json to translate string into object.